Preview of Hadoop Security Owen O’Malley Yahoo Hadoop Development [email_address]
Problem Primary Goal:  Keep Data in HDFS Secure from unauthorized access! Corollary: All HDFS clients must be authenticated to ensure they are the user they claim to be. Since Map/Reduce runs applications as the user, it must authenticate users. Since servers (HDFS, Map/Reduce) are entrusted with user credentials, they must also be authenticated. Kerberos will be the underlying authentication system. Must be able to configure security on or off.
Adding Security to a Large Project
Security Development Team Boris Shkolnik Devaraj Das Jakob Homan Owen O’Malley Kan Zhang Jitendra Nath Pandey With Paranoid assistance from: Ram Marti
Security Threats in Hadoop User to Service Authentication No User Authentication on NameNode or JobTracker Client code supplies user and group names No User Authorization on DataNode – Fixed in 0.21 Users can read/write any block No User Authorization on JobTracker Users can modify or kill other user’s jobs Users can modify the persistent state of JobTracker Service to Service Authentication No Authentication of DataNodes and TaskTrackers Users can start fake DataNodes and TaskTrackers No Encryption on Wire or Disk
Definitions Authentication  – Ensuring the user is who they claim to be. We have a very poor job of this currently We need it on both RPC and Web UI. Authorization  – Ensuring the user can only do things that they are allowed to do. HDFS does this already via owners, groups and permissions Map/Reduce does not do this
Using Kerberos and Single Signon Kerberos allows user to sign in once to obtain Ticket Granting Tickets (TGT) kinit – get a new Kerberos ticket klist – list your Kerberos tickets kdestroy – destroy your Kerberos ticket TGT’s last for 10 hours, renewable for 7 days by default PAM on Linux and Solaris can automatically do kinit for you Still needs your password Once you have a TGT Hadoop commands work like before hadoop fs –ls / hadoop jar wordcount.jar in-dir out-dir
Kerberos Dataflow
API Changes Very Minimal API Changes! UserGroupInformation *completely* changed. MapReduce added Authorization Jobs now have a Credentials object that can store secrets. (available from JobConf and JobContext) Automatically get tokens for HDFS systems Primary HDFS, File{In,Out}putFormat, and DistCp Can set mapreduce.job.hdfs-servers Set ACL’s via mapreduce.job.acl-{view,modify}-job Yahoo! Template 3, Confidential
Other MapReduce Security Changes MapReduce System directory was 777 but now 700. Tasks run as user instead of TaskTracker user. Task directories were globally visible and now 700. Distributed Cache is now secure Shared (original is world readable) is shared by everyone’s jobs. Private (original is not world readable) is shared by user’s jobs.
Web UIs Hadoop and especially MapReduce make heavy use of the Web UIs. These need to be authenticated also… We will make it pluggable, but include a login module that uses the Kerberos username and password. Even better is if someone makes a SPNEGO filter for Jetty that uses the Kerberos tickets from the browser. All of the servlets will use the authenticated username and enforce permissions appropriately.
Proxy-Users Some services must access HDFS and MapReduce as other users HDFS and MapReduce allow users to create configuration entries to define: Who the proxy service can impersonate Which hosts they can impersonate from hadoop.proxyuser.superguy.groups=goodguys hadoop.proxyuser.superguy.hosts=secretbase
Remaining Security Issues We are not encrypting on the wire. It will be possible within the framework, but not in 0.22. We are not encrypting on disk. For either HDFS or MapReduce. Encryption is expensive in terms of CPU and IO speed. Our current threat model is that the attacker has access to a user account, but not root or physical access. They can’t sniff the packets on the network.

Hadoop Security Preview

  • 1.
    Preview of HadoopSecurity Owen O’Malley Yahoo Hadoop Development [email_address]
  • 2.
    Problem Primary Goal: Keep Data in HDFS Secure from unauthorized access! Corollary: All HDFS clients must be authenticated to ensure they are the user they claim to be. Since Map/Reduce runs applications as the user, it must authenticate users. Since servers (HDFS, Map/Reduce) are entrusted with user credentials, they must also be authenticated. Kerberos will be the underlying authentication system. Must be able to configure security on or off.
  • 3.
    Adding Security toa Large Project
  • 4.
    Security Development TeamBoris Shkolnik Devaraj Das Jakob Homan Owen O’Malley Kan Zhang Jitendra Nath Pandey With Paranoid assistance from: Ram Marti
  • 5.
    Security Threats inHadoop User to Service Authentication No User Authentication on NameNode or JobTracker Client code supplies user and group names No User Authorization on DataNode – Fixed in 0.21 Users can read/write any block No User Authorization on JobTracker Users can modify or kill other user’s jobs Users can modify the persistent state of JobTracker Service to Service Authentication No Authentication of DataNodes and TaskTrackers Users can start fake DataNodes and TaskTrackers No Encryption on Wire or Disk
  • 6.
    Definitions Authentication – Ensuring the user is who they claim to be. We have a very poor job of this currently We need it on both RPC and Web UI. Authorization – Ensuring the user can only do things that they are allowed to do. HDFS does this already via owners, groups and permissions Map/Reduce does not do this
  • 7.
    Using Kerberos andSingle Signon Kerberos allows user to sign in once to obtain Ticket Granting Tickets (TGT) kinit – get a new Kerberos ticket klist – list your Kerberos tickets kdestroy – destroy your Kerberos ticket TGT’s last for 10 hours, renewable for 7 days by default PAM on Linux and Solaris can automatically do kinit for you Still needs your password Once you have a TGT Hadoop commands work like before hadoop fs –ls / hadoop jar wordcount.jar in-dir out-dir
  • 8.
  • 9.
    API Changes VeryMinimal API Changes! UserGroupInformation *completely* changed. MapReduce added Authorization Jobs now have a Credentials object that can store secrets. (available from JobConf and JobContext) Automatically get tokens for HDFS systems Primary HDFS, File{In,Out}putFormat, and DistCp Can set mapreduce.job.hdfs-servers Set ACL’s via mapreduce.job.acl-{view,modify}-job Yahoo! Template 3, Confidential
  • 10.
    Other MapReduce SecurityChanges MapReduce System directory was 777 but now 700. Tasks run as user instead of TaskTracker user. Task directories were globally visible and now 700. Distributed Cache is now secure Shared (original is world readable) is shared by everyone’s jobs. Private (original is not world readable) is shared by user’s jobs.
  • 11.
    Web UIs Hadoopand especially MapReduce make heavy use of the Web UIs. These need to be authenticated also… We will make it pluggable, but include a login module that uses the Kerberos username and password. Even better is if someone makes a SPNEGO filter for Jetty that uses the Kerberos tickets from the browser. All of the servlets will use the authenticated username and enforce permissions appropriately.
  • 12.
    Proxy-Users Some servicesmust access HDFS and MapReduce as other users HDFS and MapReduce allow users to create configuration entries to define: Who the proxy service can impersonate Which hosts they can impersonate from hadoop.proxyuser.superguy.groups=goodguys hadoop.proxyuser.superguy.hosts=secretbase
  • 13.
    Remaining Security IssuesWe are not encrypting on the wire. It will be possible within the framework, but not in 0.22. We are not encrypting on disk. For either HDFS or MapReduce. Encryption is expensive in terms of CPU and IO speed. Our current threat model is that the attacker has access to a user account, but not root or physical access. They can’t sniff the packets on the network.